.\" $X$ xroff -mgs $file
.\" $tty$ groff -mgs $file | colcrt - | more
.\" $lpr$ groff -mgs $file > ${file}.lpr
.\" Define a page top that looks cool
.de PT
.if \\n%>1 \{\
.	sp -.1i
.	ps 14
.	ft 3
.	nr big 24
.	nr space \\w'XXX'
.	nr titlewid \\w'\\*[title]'
.	nr barwid (\\n[LL]-(\\n[titlewid]+(2*\\n[space])))/2
.	ds ln \\l'\\n[barwid]u'\\h'-\\n[barwid]u'\v'-.25'
.	ds bar \\s(\\n[big]\\*(ln\\*(ln\\*(ln\\*(ln\\*(ln\v'1.25'\\h'\\n[barwid]u'\\s0
.	ce 1
\\*[bar]\h'\\n[space]u'\v'-.15'\\*[title]\v'.15'\h'\\n[space]u'\\*[bar]
.	ps
.	sp -.70
.	ps 12
\\l'\\n[LL]u'
.	ft
.	ps
.\}
..
.\" Define a page bottom that looks cool
.de BT
.	ps 9
\v'-1'\\l'\\n(LLu'
.	sp -1
.	tl '\(co 1994 \\*[author]'\\*(DY'%'
.	ps
..
.\" Configuration
.VARPS
.nr HM 1.0i
.nr FM 1i
.if t .nr PO .75i
.if t .nr LL 7.0i
.if n .nr PO .25i
.if n .nr LL 7.5i
.nr PS 11
.nr VS \n(PS+2
.ds title Portable Tools for Performance Analysis
.ds author Larry McVoy
.TL
lmbench:
.sp .5
\*[title]
.br
\s8Revision $Revision: 1.4 $ of $Date: 94/11/23 18:02:12-08:00 $\s0
.AU
\*[author]
.AI
.ps -2
lm@sgi.com\**
(415) 390-1804
.ps +2
.AB
A description of a set benchmarks for measuring system performance.
The benchmarks include latency measurements of basic system operations
such as memory, processes, networking, and disks, and bandwidth measurements
of memory, disks, and networking.
The benchmarks have been run under a wide variety of Unix systems.
The benchmarks are freely distributed under
the GNU General Public License, with the additional restriction 
that results may be reported only if the benchmarks are unmodified.
.AE
.sp 2
.if t .2C
.FS
This work was mostly done while the author was an employee of Sun Microsystems
Computer Corporation.
.FE
.NH 1
Introduction
.LP
The purpose of this project is to provide the computer community with tools
for performance analysis of basic operations of their computer systems.
The tools are designed
to be both portable and comparable over a wide set of Unix systems.\**
.FS
The tools have been run on 
AIX,
BSDI,
HP-UX,
IRIX,
Linux,
NetBSD,
OSF/1,
Solaris,
and
SunOS by the author.
.FE
The interfaces that the tools use have been carefully chosen to be as portable 
and standard as possible.  It is an explicit intent of the benchmark to measure
standard interfaces.  Users of this benchmark may not report results from
modified versions of the benchmarks.\**
.FS
For example, the context switch benchmark may not use a \f(CWyield()\fP
primitive instead of pipes; the networking benchmarks must use the socket
interfaces, not TLI or some other interface.
.FE
.PP
The purpose of
this document is to describe each of the benchmarks.  
.PP
The benchmarks are loosely divided into latency, bandwidth, and ``other''
categories.
.NH 1
Latency measurements
.LP
The latency measurements included in this suite are process creation times
(including address space extension via mmap()),
basic operating system entry cost, context switching, inter process
communication, file system latency, 
disk latency (you must be the super user to get
disk latency results), and memory latency.
.PP
Process benchmarks are used to measure the basic process primitives,
such as creating a new process, running a different program, and context 
switching.  Process creation benchmarks are of particular interest
to distributed systems since many remote operations include the creation
of a remote process to shepherd the remote operation to completion.
Context switching is important for the same reasons.
.PP
Inter process communication latency is important because many operations
are control messages that tell another process (frequently on another
system) to do something.  The latency of telling the remote process to
do something is pure overhead and is frequently in the critical path
of important functions, such as distributed databases.\**
.FS
The performance of the TCP latency benchmark has proven to be a good
estimate of the performance of the Oracle database lock manager.
.FE
.PP
The inter process communication latency benchmarks are roughly the same
idea: pass a small message (a byte or so) back and forth between two
processes.  The reported results are always the microseconds it takes
to do one round trip.  If you are interested in a one way timing, then
about half the round trip is right (however, the CPU cycles tend to be
somewhat asymmetric for a one trip).  
.NH 2
Process forks/exits
.LP
Create a child process which does nothing but 
terminate.  Results are reported in creations per second.  
The benchmark is measuring how fast the OS can create a new address
space and process context.
The child process is spawned via the \f(CBfork\fP() interface,
not the \f(CBvfork\fP() interface. 
.NH 2
Simple process creates I
.LP
Create a child process which then runs a new program that does nothing
but print ``hello world'' and exit.  The difference between this 
benchmark and the previous is the running of a new program.  The
time difference between this and the previous benchmark is the cost
of starting a new (simple) program.  That cost is especially noticeable
on (some) systems that have shared libraries.  Shared libraries can
introduce a substantial (10s of milliseconds) start up cost.  This 
benchmark is intended to quantify the time/space tradeoff of shared
libraries.
.NH 2
Simple process creates II
.LP
Create a child process which runs the same new program except that the
program is started by the system shell.  This is a clone of the C
library \f(CBsystem\fP() interface.  The intent is to educate users
about the cost of this interface.  I have long felt that using the
Bourne shell, especially a dynamically linked Bourne shell, to start up
processes is over kill; perhaps these numbers will convince others of the
same thing.  A better choice would be Plan 9's \f(CBrc\fP shell (which
is, by the way, free software).
.NH 2
Memory mapping 
.LP
Memory mapping is the process of making a file part of a process' address
space, allowing direct access to the file's pages.  It is an alternative
to the traditional read and write interfaces.  Memory mapping is extensively
used for linking in shared libraries at run time.  This benchmark measures
the speed at which mappings can be created as well as removed.  Results
are reported in mappings per second, and the results can be graphed as the
test is run over a series of different sizes.
.NH 2
Context switches
.LP
Measures process context switch time.\**  A context switch is defined as 
the time it takes to save the state of one process and restore the state
of another process.
Typical context switch benchmarks measure just the minimal context switch
time, i.e., the time to switch between two processes that are doing nothing
but context switching.  That approach is misleading because systems may
have multiple active processes and the processes typically have more state
(hot cache lines) than just the code required to force another context
switch.  This benchmark takes that into consideration and varies both
the number and the size of the processes.
.FS
A previous version of this benchmark included several system calls
in addition to the context switch, resulting in grossly over inflated
context switch times.
.FE
.PP
The benchmark is a ring of two to twenty processes that are connected
with Unix pipes.  A token is passed from process to process, forcing
context switches.  The benchmark measures the time it takes to pass
the token two thousand times from process to process.  Each hand off
of the token has two costs: (a) the context switch, and (b) the cost 
of passing the token.  In order to get just the context switching time,
the benchmark first measures the cost of passing the token through a 
ring of pipes in a single process.  This time is defined as the cost 
of passing the token and is not included in the reported context switch
time.
.PP
When the processes are larger than the default baseline of ``zero''
(where zero means just big enough to do the benchmark), the cost
of the context switch includes the cost of restoring user level
state (cache lines).  This is accomplished by having the process
allocate an array of data and sum it as a series of integers
after receiving the token but before passing the token to the
next process.  Note that the overhead mentioned above includes 
the cost of accessing the data but because it is measured in 
just one address space, the cost is typically the cost with hot
caches.  So the context switch time does not include anything
other than the context switch provided that all the processes
fit in the cache.  If there are cache misses (as is common), the
cost of the context switch includes the cost of those cache misses.
.PP
Results for an HP system running at 100 mhz are shown below.  
This is a particularly nice system for this benchmark because the
results are quite close to what is expected from a machine with a
256KB cache.  As the size and number of processes are both increased,
processes start falling out of the cache, resulting in higher context
switch times.
.LP
.so ctx.pic
.NH 2
Null system calls
.LP
Measures the cost of entering and exiting (without pausing) the
operating system.  This is accomplished by repeatedly writing one byte
to \f(CB/dev/null\fP, a pseudo device driver that does nothing but
discard the data.  Results are reported as system calls per second.
.PP
It is important to note that the system call chosen actually does the
work on all systems, to the best of my knowledge.  There are some
systems that optimized trivial system calls, such as \f(CBgetpid\fP(),
to return the answer without a true entry into the OS proper.  Writing
to \f(CB/dev/null\fP has not been optimized.
.NH 2
Pipe latency
.LP
This benchmark measures the OS; there is almost no code executed at
user level.  The benchmark measures the round trip time of a small message
being passed back and forth between two processes through a pair of
Unix pipes.
.NH 2
TCP/IP latency
.LP
This benchmark measures the OS
networking code and the driver code; there is almost no code executed at
user level.  The benchmark measures the round trip time of a small message
being passed back and forth between two processes through an AF_INET
socket.  Note that both remote and local results may be reported.
.NH 2
UDP/IP latency
.LP
This benchmark measures the OS
networking code and the driver code; there is almost no code executed at
user level.  The benchmark measures the round trip time of a small message
being passed back and forth between two processes through an AF_INET socket.
Note that both remote
and local results may be reported.
.LP
It is interesting to note that the TCP performance is sometimes
greater than the UDP performance.  
This is contrary to expectations since
the TCP protocol is a reliable, connection oriented protocol, and as such
is expected to carry more overhead.
Why this is so is an exercise left to the
reader.
.NH 2
RPC latency (TCP and UDP)
.LP
Actually two latency benchmarks: Sun RPC over TCP/IP and over UDP/IP.
This benchmark consists of the user level RPC code layered over the TCP
or UDP sockets.  The benchmark measures the round trip time of a small
message being passed back and forth between two processes.  Note that
both remote and local results may be reported.
.LP
Using the TCP or the UDP benchmarks as a baseline, it 
is possible to see how much the RPC code is costing.  
.NH 2
TCP/IP connect latency
.LP
This benchmarks measures the time it takes to get a TCP/IP socket and
connect it to a remote server.
.NH 2
File system latency
.LP
A benchmark that measures how fast the file system can do basic, common 
operations, such as creates and deletes of small files.  
.NH 2
Page fault latency
.LP
A benchmark that measures how fast the file system can pagefault in a
page that is not in memory.
.NH 2
Disk latency
.LP
A benchmark that is designed to measure the overhead of a disk
operation.  Results are reported as operations per second.
.PP
The benchmark is designed with SCSI disks in mind.  It actually simulates
a large number of disks in the following way.  The benchmark reads 512 byte
chunks sequentially from the raw disk device (raw disks are unbuffered
and are not read ahead by Unix).  The benchmark ``knows'' that most
disks have read ahead buffers that read ahead the next 32-128 kilobytes.
Furthermore, the benchmark ``knows'' that the disks rotate and read ahead
faster than the processor can request the chunks of data.\**
.FS
This may not always be true - a processor could be fast enough to make the
requests faster than the rotating disk.  If we take 3MB/sec to be disk
speed, a fair speed, and divide that by 512, that is 6144 IOs/second, or
163 microseconds per IO.  I don't know of any processor/OS/io controller
combinations that can do an
IO in 163 microseconds.
.FE
So the benchmark is basically reading small chunks of data from the
disks track buffer.  Another way to look at this is that the benchmark 
is doing memory to memory transfers across a SCSI channel.
.PP
No matter how you look at it, the resulting number represents a 
\fBlower\fP bound on the overhead of a disk I/O.  In point of fact,
the real numbers will be higher on SCSI systems.  Most SCSI controllers
will not disconnect if the request can be satisfied immediately; that is
the case here.  In practice, the overhead numbers will be higher because
the processor will send the request, disconnect, get interrupted,
reconnect, and transfer.
.PP
It is possible to generate loads of upwards of 500 IOPs on a single
SCSI disk using this technique.  It is useful to do that to figure out
how many drives could be supported on a system before there are no
more processor cycles to handle the load.  Using this trick, you 
do not have to hook up 30 drives, you simulate them.
.NH 2
Memory read latency
.LP
This is perhaps the most interesting benchmark in the suite.  The
entire memory hierarchy is measured, including onboard cache latency
and size, external cache latency and size, main memory latency, and TLB
miss latency.  
.PP
The benchmark varies two parameters, array size and array stride.  
For each size, a list of pointers is created for all of the different 
strides.  Then the list is walked like so
.DS
.ft CB
mov  r0,(r0)  # C code: p = *p;
.DE
The time to do about fifty thousand loads (the list wraps) is measured and
reported.  The time reported is pure latency time and may be zero even though
the load instruction does not execute in zero time.  Zero is defined as one
clock cycle; in other words, the time reported is \fBonly\fP memory latency
time, as it does not include the instruction execution time.  It is assumed
that all processors can do a load instruction (not counting stalls) in one
processor cycle.  In other words, if the processor cache load time
is 60 nanoseconds on a 20 nanosecond processor, the load latency reported
would be 40 nanoseconds, the missing 20 seconds is for the load instruction
itself.  Processors that can manage to get the load address out to the 
address pins before the end of the load cycle get some free time in this
benchmark (I don't think any processors can do that).
.PP
Note that this benchmark has been validated by logic analyzer measurements
on an SGI indy. The
clever reader might realize that last few nanoseconds of inaccuracy could be
rounded off by realizing that the latency is always going to be a multiple
of the processor clock rate.
.PP
The raw data is a series of data sets.  Each data set is a stride size,
with array size varied from about one kilobyte up to eight megabytes.
When these data sets are all plotted together (using a log base 2 scale
for the size variable), the data will be seen to contain a series of 
horizontal plateaus.  The first is the onboard data cache latency (if there
is an onboard cache).  The point where the lines start to go up marks the
size of the cache.  The second is the external cache, the third is the
main memory, and the last is main memory plus TLB miss cost.  In addition
to this information, the cache line size can be derived by noticing which
strides are faster than main memory times.  The first stride that is
main memory speed is likely to be the cache line size.  The reason is
that the strides that are faster than memory indicate that the benchmark is
getting more than one hit per cache line.  Note that prefetching may confuse
you.
.PP
The graph below shows a particularly nicely made machine, a DEC alpha.
This machine is nice because (a) it shows the latencies and sizes of
the on chip level 1 and motherboard level 2 caches, and (b) because it
has the best all around numbers, especially considering it can support a
4MB level 2 cache.  Nice work, DEC.
.so mem.pic
.NH 1
Bandwidth measurements
.LP
One of my former managers\** once noted that ``Unix is Swahili for bcopy().''
I believe that he was indicating his belief that the operating system spent
most of its time moving data from one place to another, via various means.
I tend to agree and have measured the various ways that data can be moved.
The ways that are measured are: through pipes, TCP sockets, library bcopy()
and hand unrolled bcopy(), the read() interface, through the mmap() interface,
and direct memory read and write (no copying).
.FS
Ken Okin
.FE
.NH 2
Pipe bandwidth
.LP
Bandwidth measurement between two local processes communicating through
a Unix pipe.  Results are in megabytes per second.
.NH 2
TCP/IP socket bandwidth
.LP
Bandwidth measurement using TCP/IP sockets.  Results are reported in megabytes
per second.  
Results are reported for local, ethernet, FDDI, and ATM, where possible.  
Results range from 1-10+ megabytes per second.  Any system delivering 
more than 10 MB/second over TCP is doing very well by 1994 standards.
.PP
Note that for local measurements, the system is actually moving 
twice as much data, since the data is being moved to/from the same host.
.PP
Local bandwidths are (sometimes) useful for determining the overhead of the
protocol stack (as well as other OS tasks, such as context switching).  
Note, however, that some implementations (such as Solaris 2.x) have 
``fast pathed'' loopback IP which skews the results.  The fast path
uses a larger MTU and does not do checksums.
.PP
The sockets are configured to use the largest receive/send buffers that the OS
will allow.  This is done to allow maximum bandwidth.  Sun's 4.x TCP/IP
subsystem (and probably BSD's as well) default to 4KB send/receive buffers,
which is too small.  (It would be better if the OS noted that this was a
high volume / high bandwidth connection and automatically grew the buffers.
Hint, hint.)
.NH 2
bcopy bandwidths
.LP
A simple benchmark that measures how fast data can be copied.  A hand
unrolled version and the C library version are tested.  Results are
reported in megabytes per second.  Note that a typical system is actually
moving about three times as much memory as the reported result.  A copy
is actually a read, a write which causes a cache line read, and a write
back.
.NH 2
Read bandwidth
.LP
Most VM system cache file pages for reuse.  This benchmark measures the
speed at which those pages can be reused.  It is important to notice
that this is not a disk read measurement, it is a memory read measurement.
Results are reported in megabytes per second.
.NH 2
Mmap read bandwidth
.LP
The same measurement as the previous benchmark except that it maps the
file, avoiding the copy from kernel to user buffer.  
Results are reported in megabytes per second.
.NH 2
Memory read bandwidth
.LP
A large array is repeatedly read sequentially.
Results reported in megabytes per second.
.NH 2
Memory write bandwidth
.LP
A large array is repeatedly written sequentially.
Results reported in megabytes per second.
.NH 1
Other measurements
.LP
.NH 2
Processor cycle time
mhz
.LP
Calculates the megahertz and clock speed of the processor.  This is the
standard loop in which a series of interlocked operations are timed,
and then the megahertz is derived from the timing.  The operations 
are purposefully interlocked to overcome any super scalerness of the
system under test.
.PP
There are actually three versions of mhz, a generic one that works on
most systems, and two specific versions for SuperSPARC and rs6000
systems.
.PP
It turns out that the
SuperSPARC processor has two ALU's that are run at twice the clock rate,
allowing two interlocked operations to complete in one processor clock.\**
.FS
Credit and thanks to John Mashey of SGI/MIPS fame, who kindly took the
time to out why the benchmark wasn't working on SuperSPARC
systems.  He explained the SuperSPARC pipeline and the solution to the
problem.
.FE
Fortunately, the ALU's are asymmetric and can not do two shifts in
one processor clock.  Shifts are used on SuperSPARC systems.
.PP
IBM rs6000 systems have a C compiler that does not honor the
``register'' directive in unoptimized code.  The IBM loop looks
like it is doing half as many instructions as the others.  This
is on purpose, each add on the IBM is actually two instructions
(I think it is a load/add/store or something like that).
.NH 1
Acknowledgments
.LP
I would like to acknowledge Sun Microsystems for supporting the development
of this project.  In particular,  my personal thanks to Paul Borrill,
Director of the Architecture and Performance group, for conceiving and
supporting the development of these benchmarks.
.PP
My thanks to John Mashey and Neal Nuckolls of Silicon Graphics for reviews,
comments, and explanations of the more obscure problems.
.PP
My thanks to Satya Nishtala of Sun Microsystems for (a) listening to me
complain about memory latencies over and over, (b) doing something about
it in future SPARC systems, and (c) reviewing the memory latency results
and explained IBM's sub blocking scheme (I still don't really understand
it but he does.  Ask him).
.NH 1
Obtaining the benchmarks
.LP
The benchmarks will be posted to the Usenet comp.benchmarks group.  In addition,
mail sent to \f(CBarchives@slovax.engr.sgi.com\fP with a request for 
\f(CBlmbench.shar\fP
sources will get the latest and greatest.
